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Abstract. We describe a parallel iterative least squares solver named LSRN that is based on 
random normal projection. LSRN computes the min-length solution to min^gRn \\Ax — fc> 1 1 2 , where 
A £ Jj m x™ with m 2> n or m <C n, and where A may be rank-deficient. Tikhonov regularization may 
also be included. Since A is only involved in matrix-matrix and matrix-vector multiplications, it can 
be a dense or sparse matrix or a linear operator, and LSRN automatically speeds up when A is sparse 
or a fast linear operator. The preconditioning phase consists of a random normal projection, which 
is embarrassingly parallel, and a singular value decomposition of size [7miii(m,n)] X min(m,n), 
where 7 is moderately larger than 1, e.g., 7 = 2. We prove that the preconditioned system is 
well-conditioned, with a strong concentration result on the extreme singular values, and hence that 
the number of iterations is fully predictable when we apply LSQR or the Chebyshev semi-iterative 
method. As we demonstrate, the Chebyshev method is particularly efficient for solving large problems 
on clusters with high communication cost. Numerical results demonstrate that on a shared-memory 
machine, LSRN outperforms LAPACK's DGELSD on large dense problems, and MATLAB's backslash 
(SuiteSparseQR) on sparse problems. Further experiments demonstrate that LSRN scales well on an 
Amazon Elastic Compute Cloud cluster. 
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1. Introduction. Randomized algorithms have become indispensable in many 
areas of computer science, with applications ranging from complexity theory to com- 
binatorial optimization, cryptography, and machine learning. Randomization has also 
been used in numerical linear algebra (for instance, the initial vector in the power iter- 
ation is chosen at random so that almost surely it has a nonzero component along the 
direction of the dominant eigenvector), yet most well-developed matrix algorithms, 
e.g., matrix factorizations and linear solvers, arc deterministic. In recent years, how- 
ever, motivated by large data problems, very nontrivial randomized algorithms for 
very large matrix problems have drawn considerable attention from researchers, orig- 
inally in theoretical computer science and subsequently in numerical linear algebra 
and scientific computing. By randomized algorithms, we refer in particular to ran- 
dom sampling and random projection algorithms [3 H21 [5J HTJ [T] . For a comprehensive 
overview of these developments, see the review of Mahoney [17], and for an excellent 
overview of numerical aspects of coupling randomization with classical low-rank ma- 
trix factorization methods, see the review of Halko, Martinsson, and Tropp [13]. 
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Here, we consider high-precision solving of linear least squares (LS) problems that 
are strongly over- or under-determined, and possibly rank-deficient. In particular, 
given a matrix A £ ]g™ xn an d a vector b € M m , where m ^> n or m <C n and we 
do not assume that A has full rank, we wish to develop randomized algorithms to 
compute accurately the unique min-length solution to the problem 

minimize xe Rn H^Aar — (1-1) 

If we let r = rank(A) < min(m, n), then recall that if r < n (the LS problem is 



under-determined or rank-deficient), then (1.1) has an infinite number of minimizers 



In that case, the set of all minimizers is convex and hence has a unique element having 
minimum length. On the other hand, if r = n so the problem has full rank, there 



exists only one minimizer to (1.1 1 and hence it must have the minimum length. In 
either case, we denote this unique min-length solution to ( |1.1[ ) by x*. That is, 

x* = argmin ||a;||2 subject to x G argmin \\Az — b\\%. (1-2) 

z 

LS problems of this form have a long history, tracing back to Gauss, and they arise 
in numerous applications. The demand for faster LS solvers will continue to grow in 
light of new data applications and as problem scales become larger and larger. 

In this paper, we describe an LS solver called LSRN for these strongly over- or 
under-determined, and possibly rank-deficient, systems. LSRN uses random normal 
projections to compute a preconditioner matrix such that the preconditioned system 
is provably extremely well-conditioned. Importantly for large-scale applications, the 
preconditioning process is embarrassingly parallel, and it automatically speeds up 
with sparse matrices and fast linear operators. LSQR [20] or the Chebyshev semi- 
iterative (CS) method [11] can be used at the iterative step to compute the min-length 
solution within just a few iterations. We show that the latter method is preferred on 
clusters with high communication cost. 

Because of its provably-good conditioning properties, LSRN has a fully predictable 
run-time performance, just like direct solvers, and it scales well in parallel environ- 
ments. On large dense systems, LSRN is faster than LAPACK's DGELSD for strongly 
over-determined problems, and is much faster for strongly under-determined prob- 
lems, although solvers using fast random projections, like Blendenpik [T], are still 
slightly faster in both cases. On sparse systems, LSRN runs significantly faster than 
competing solvers, for both the strongly over- or under-determined cases. 

In section [2] we describe existing deterministic LS solvers and recent randomized 
algorithms for the LS problem. In section [3] we show how to do preconditioning cor- 
rectly for rank-deficient LS problems, and in section [4] we introduce LSRN and discuss 
its properties. Section [5] describes how LSRN can handle Tikhonov regularization for 
both over- and under-determined systems, and in section [6] we provide a detailed 
empirical evaluation illustrating the behavior of LSRN. 

2. Least squares solvers. In this section we discuss related work, including 
deterministic direct and iterative methods as well as recently developed randomized 
methods, for computing solutions to LS problems, and we discuss how our results fit 
into this broader context. 



2.1. Deterministic methods. It is well known that x* in (1.2) can be com- 
puted using the singular value decomposition (SVD) of A. Let A = UT,V T be 
the economy-sized SVD, where U G R mxr , S G R rxr , and V G W lXr . We have 
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x* = VT, U T b. The matrix V r S _1 ?7 T is the Moore- Penrose pseudoinverse of A, de- 
noted by A^ . The pseudoinverse is defined and unique for any matrix. Hence we can 
simply write x* = A'b. The SVD approach is accurate and robust to rank-deficiency. 



Another way to solve ( 1.2 ) is using a complete orthogonal factorization of A. If we 
can find orthonormal matrices Q € M. mxr and Z 6 IR™ xr , and a matrix T £ W xr , such 
that A = QTZ T , then the min-length solution is given by x* = ZT^ 1 Q T b. We can 
treat SVD as a special case of complete orthogonal factorization. In practice, complete 
orthogonal factorization is usually computed via rank-revealing QR factorizations, 
making T a triangular matrix. The QR approach is less expensive than SVD, but it 
is slightly less robust at determining the rank of A. 



A third way to solve ( 1.2 ) is by computing the min-length solution to the normal 



equation A Ax = A b, namely 

x* = (A T A) J< A T b = A T {AAyb. (2.1) 



It is easy to verify the correctness of ( J2 . 1 [ ) by replacing A by its economy-sized SVD 
UTjV t . If r — mm(m,n), a Cholesky factorization of either A T A (if m > n) or AA T 



(if m < n) solves (2.1) nicely. If r < min(r7i,n), we need the eigensystem of A A or 
AA T to compute x* . The normal equation approach is the least expensive among the 
three direct approaches we have mentioned, especially when m»norm«n, but it 
is also the least accurate one, especially on ill-conditioned problems. See Chapter 5 
of Golub and Van Loan [TU] for a detailed analysis. 



Instead of these direct methods, we can use iterative methods to solve (1.1). If 
all the iterates {x^} are in range(^4 T ) and if {x^} converges to a minimizer, it 



must be the minimizer having minimum length, i.e., the solution to (1.2). This is 
the case when we use a Krylov subspace method starting with a zero vector. For 
example, the conjugate gradient (CG) method on the normal equation leads to the 
min-length solution (see Paige and Saunders [H]). In practice, CGLS [IS] . LSQR [20] 
are preferable because they are equivalent to applying CG to the normal equation 
in exact arithmetic but they are numerically more stable. Other Krylov subspace 
methods such as the CS method [TT] and LSMR [9] can solve ( |1.1[ ) as well. 

Importantly, however, it is in general hard to predict the number of iterations for 
CG-like methods. The convergence rate is affected by the condition number of A T A. 
A classical result |T6l p. 187] states that 



\A r A 



<2 ( M^)-A k (22) 

\\xM-x*\\ ATA - \^(A^Aj+l) ' 1 • ' 

where H^H^t^ = z T A T Az = \\Az\\ 2 for any z £ R n , and where k(A t A) is the condition 
number of A 1 A under the 2-norm. Estimating k{A t A) is generally as hard as solving 
the LS problem itself, and in practice the bound does not hold in any case unless 
reorthogonalization is used. Thus, the computational cost of CG-like methods remains 
unpredictable in general, except when A T A is very well-conditioned and the condition 
number can be well estimated. 

2.2. Randomized methods. In 2007, Drineas, Mahoney, Muthukrishnan, and 
Sarlos [8] introduced two randomized algorithms for the LS problem, each of which 
computes a relative-error approximation to the min-length solution in 0{mn log n) 
time, when m n. Both of these algorithms apply a randomized Hadamard trans- 
form to the columns of A, thereby generating a problem of smaller size, one using 
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uniformly random sampling and the other using a sparse random projection. They 
proved that, in both cases, the solution to the smaller problem leads to relative-error 
approximations of the original problem. The accuracy of the approximate solution de- 
pends on the sample size; and to have relative precision e, one should sample 0(n/e) 
rows after the randomized Hadamard transform. This is suitable when low accuracy 
is acceptable, but the e dependence quickly becomes the bottleneck otherwise. Using 
those algorithms as preconditioners was also mentioned in [5]. This work laid the 
ground for later algorithms and implementations. 

Later, in 2008, Rokhlin and Tygert [21] described a related randomized algorithm 
for over-determined systems. They used a randomized transform named SRFT that 
consists of m random Givens rotations, a random diagonal scaling, a discrete Fourier 
transform, and a random sampling. They considered using their method as a precon- 
ditioning method, and they showed that to get relative precision e, only C(nlog(l/e)) 
samples are needed. In addition, they proved that if the sample size is greater than 
4n 2 , the condition number of the preconditioned system is bounded above by a con- 
stant. Although choosing this many samples would adversely affect the running time 
of their solver, they also illustrated examples of input matrices for which the An 2 
sample bound was weak and for which many fewer samples sufficed. 

Then, in 2010, Avron, Maymounkov, and Toledo [1] implemented a high-precision 
LS solver, called Blendenpik, and compared it to LAPACK's DGELS and to LSQR 
with no preconditioning. Blendenpik uses a Walsh-Hadamard transform, a discrete 
cosine transform, or a discrete Hartley transform for blending the rows/columns, 
followed by a random sampling, to generate a problem of smaller size. The R factor 
from the QR factorization of the smaller matrix is used as the preconditioner for 
LSQR. Based on their analysis, the condition number of the preconditioned system 
depends on the coherence or statistical leverage scores of A, i.e., the maximal row 
norm of U, where U is an orthonormal basis of range(A). We note that a solver for 
under-determined problems is also included in the Blendenpik package. 

In 2011, Coakley, Rokhlin, and Tygert [5] described an algorithm that is also based 
on random normal projections. It computes the orthogonal projection of any vector 
b onto the null space of A or onto the row space of A via a preconditioned normal 
equation. The algorithm solves the over-determined LS problem as an intermediate 
step. They show that the normal equation is well-conditioned and hence the solution 
is reliable. For an over-determined problem of size m x n, the algorithm requires 
applying A or A T 3n + 6 times, while LSRN needs approximately 2n + 200 matrix- 
vector multiplications under the default setting. Asymptotically, LSRN will become 
faster as n increases beyond several hundred. See section |4.3| for further complexity 
analysis of LSRN. 

2.3. Relationship with our contributions. All prior approaches assume that 
A has full rank, and for those based on iterative solvers, none provides a tight upper 
bound on the condition number of the preconditioned system (and hence the number of 



iterations). For LSRN, Theorem 3.2 ensures that the min-length solution is preserved, 
independent of the rank, and Theorems |4.4| and |4.5| provide bounds on the condition 
number and number of iterations, independent of the spectrum of A. In addition 
to handling rank-deficiency well, LSRN can even take advantage of it, resulting in a 
smaller condition number and fewer iterations. 

Some prior work on the LS problem has explored "fast" randomized transforms 
that run in roughly O(mn\ogm) time on a dense matrix A, while the random normal 
projection we use in LSRN takes 0(mn 2 ) time. Although this could be an issue for some 
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applications, the use of random normal projections comes with several advantages. 
First, if A is a sparse matrix or a linear operator, which is common in large-scale 
applications, then the Hadamard-based fast transforms are no longer "fast" . Second, 
the random normal projection is easy to implement using threads or MPI, and it scales 
well in parallel environments. Third, the strong symmetry of the standard normal 
distribution helps give the strong high probability bounds on the condition number 
in terms of sample size. These bounds depend on nothing but s/r, where s is the 
sample size. For example, if s — 4r, Theorem |4.4| ensures that, with high probability, 
the condition number of the preconditioned system is less than 3. 

This last property about the condition number of the preconditioned system 
makes the number of iterations and thus the running time of LSRN fully predictable 
like for a direct method. It also enables use of the CS method, which needs only one 
level- 1 and two level-2 BLAS operations per iteration, and is particularly suitable for 
clusters with high communication cost because it doesn't have vector inner products 
that require synchronization between nodes. Although the CS method has the same 
theoretical upper bound on the convergence rate as CG-like methods, it requires ac- 
curate bounds on the singular values in order to work efficiently Such bounds are 
generally hard to come by, limiting the popularity of the CS method in practice, 
but they are provided for the preconditioned system by our Theorem |4.4[ and we do 
achieve high efficiency in our experiments. 



3. Preconditioning for linear least squares. In light of (2.2), much effort 
has been made to transform a linear system into an equivalent system with reduced 
condition number. This preconditioning, for a square linear system Bx = d of full 
rank, usually takes one of the following forms: 

left preconditioning M T Bx = M T d, 
right preconditioning BNy = d, x = Ny, 
left and right preconditioning M T BNy = M T d, x = Ny. 

Clearly, the preconditioned system is consistent with the original one, i.e., has the 
same x* as the unique solution, if the preconditioners M and N are nonsingular. 



For the general LS problem (1.2), preconditioning needs better handling in order 
to produce the same min-length solution as the original problem. For example, if 
we apply left preconditioning to the LS problem min x \\Ax — b\\2, the preconditioned 
system becomes min^ ||M T Ax — M T b\\2, and its min-length solution is given by 

x* lcft = (M T A)^M T b. 

Similarly, the min-length solution to the right preconditioned system is given by 

< ight = N(AN)h. 

The following lemma states the necessary and sufficient conditions for A^ = N(ANy 
or A^ ~ [M T A)^M T to hold. Note that these conditions holding certainly imply that 
x right = x * ano - x \eh = x * ' respectively. 

Lemma 3.1. Given A e E mxn ; N € M. nxp and M e M. mx i, we have 

1. A^ = N(ANY if and only if range{NN T A T ) = range{A T ), 

2. A^ = (M T A)^M T if and only if range(MM T A) = range(A) 
Proof. Let r = rank(A) and U"EV T be A's economy-sized SVD as in section 



2.1 



with A^ = VYj 1 U t . Before continuing our proof, we reference the following facts 
about the pseudoinverse: 
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1. = B T {BB T Y for any matrix B, 

2. For any matrices B and C such that BC is defined, (BCY — C^B^ if 
(i) B T B = I or (ii) CC T — I or (iii) B has full column rank and C has 
full row rank. 

Now let's prove the "if" part of the first statement. If range(iV N T A T ) = range(^4 T ) = 
range(V), we can write NN T A T as VZ where Z has full row rank. Then, 

N{AN)^ = N(AN) T (AN(AN)y = NN T A T {ANN T A T )^ 

= V\Z(i7EV T V.Z) t = VZ{UY>Z)^ = VZZ^Yr x XJ T = VTr l U T = A*. 

Conversely, if N{ANy — A\ we know that range (N(AN)^) = range(yl t ) = range(F) 
and hence range (V) C range(TV). Then we can decompose N as (V V c )(^j = 
VZ + V C Z C , where V c is orthonormal, V T V C = 0, and (§) ^ as row raiui - Then, 

= N{AN)^ -A^ = {VZ + V C Z C )(UEV T (VZ + V C Z C ))^ - V^U 7 
= {VZ + V c Z c )(UEZy - VTr 1 ^ 
= {VZ + V c Z c )Z t E" 1 J7 T - VT,- 1 ^ = V c Z c Z^Yr l U T , 

Multiplying by V C T on the left and LT on the right, we get Z C Z^ = 0, which is 
equivalent to Z C Z T = 0. Therefore, 

range(AW T A T ) = range((VZ + V C Z C ){VZ + V C Z C ) T VYAJ T ) 
= mnge({VZZ T V T + V c Z c Z T c Vj)VYAJ T ) 
= range (VZZ T EJ7 T ) 
= range(F) = range(-A T ), 

where we used the facts that Z has full row rank and hence ZZ T is nonsingular, £ is 
nonsingular, and U has full column rank. 

To prove the second statement, let us take B — A T . By the first statement, wc 
know B^ — M(BMy if and only if range(MAf T i? T ) = range(i? T ), which is equivalent 
to saying A^ = {M T A)^M T if and only if range(MM T A) = range(A). □ 

Although Lemma |3.1| gives the necessary and sufficient condition, it does not 
serve as a practical guide for preconditioning LS problems. In this work, we are more 
interested in a sufficient condition that can help us build preconditioners. To that 
end, we provide the following theorem. 

Theorem 3.2. Given A e R mxn , b e R m , N e R nxp , and M e R mx i, let x* be 
the min-length solution to the LS problem min x \\Ax — b\\2, x* ight — Ny* where y* is 
the min-length solution to min y \\ANy — b\\2, and x* le f t be the min-length solution to 
min^ 1 1 M T Ax - M T b || 2 . Then, 

L x Ught = x * if range{N) = range{A T ), 
2. x* le j t — x* if range{M) = range{A) . 

Proof. Let r — rank(^4) and t/EV rT be A's economy-sized SVD. If range(iV) = 
range(A T ) = range(V), we can write N as VZ, where Z has full row rank. Therefore, 

range {NN T A T ) = r&nge{VZZ T V T VY,U T ) = r&nge{V Z Z T ZU T ) 
= range(y) = range(A T ). 

By Lemma [3.l| A^ — N(AN)^ and hence x* eit = x* . The second statement can be 
proved by similar arguments. □ 
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4. Algorithm LSRN. In this section we present LSRN, an iterative solver for 
solving strongly over- or under-determined systems, based on "random normal pro- 
jection" . To construct a preconditioner we apply a transformation matrix whose en- 
tries are independent random variables drawn from the standard normal distribution. 
We prove that the preconditioned system is almost surely consistent with the original 
system, i.e., both have the same min-length solution. At least as importantly, we 
prove that the spectrum of the preconditioned system is independent of the spectrum 
of the original system; and we provide a strong concentration result on the extreme 
singular values of the preconditioned system. This concentration result enables us to 
predict the number of iterations for CG-like methods, and it also enables use of the CS 
method, which requires an accurate bound on the singular values to work efficiently 

4.1. The algorithm. Algorithm [l] shows the detailed procedure of LSRN to com- 
pute the min-length solution to a strongly over-determined problem, and Algorithm 
[2] shows the detailed procedure for a strongly under-determined problem. We refer 
to these two algorithms together as LSRN. Note that they only use the input matrix 
A for matrix-vector and matrix-matrix multiplications, and thus A can be a dense 
matrix, a sparse matrix, or a linear operator. In the remainder of this section we 
focus on analysis of the over-determined case. We emphasize that analysis of the 
under-determined case is quite analogous. 

Algorithm 1 LSRN (computes x « A^b when m 3> n) 
1: Choose an oversampling factor 7 > 1 and set s = pyn.~|. 

2: Generate G = randn(s, m), i.e., an s-by-m random matrix whose entries are 

independent random variables following the standard normal distribution. 
3: Compute A = GA. 

4: Compute A's economy-sized SVD UY,V T , where r = rank(A), U e R sxr , S G 

R rxr , V e R nxr , and only £ and V are needed. 
5: Let N = VTT 1 . 

6: Compute the min-length solution to min y || ANy — b\\2 using an iterative method. 

Denote the solution by y. 
7: Return x = Ny. 



Algorithm 2 LSRN (computes x w A^b when m <§; n) 
1: Choose an oversampling 7 > 1 and set s = [7m] . 

2: Generate G = randn(n, s), i.e., an n-by-s random matrix whose entries are inde- 
pendent random variables following the standard normal distribution. 
3: Compute A = AG. 

4: Compute A's economy-sized SVD Ut>V T , where r = rank(A), U G R nxr , S g 

R rxr , V e R sxr , and only U and E are needed. 
5: Let M = UY,- 1 . 

6: Compute the min-length solution to min^ ||M T Ar — M T b\\ 2 using an iterative 

method, denoted by x. 
7: Return x. 



4.2. Theoretical properties. The use of random normal projection offers LSRN 
some nice theoretical properties. We start with consistency. 

Theorem 4.1. In AlgorithmU^ we have x = A^b almost surely. 
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Proof. Let r = rank(A) and UYiV T be A's economy-sized SVD. We have 

range(iV) = range(V£ _1 ) = range(V) 

= range(i r ) = range(A T G T ) = range(V£(G?7) T ). 

Define G\ = GU £ R sxr . Since G's entries are independent random variables fol- 
lowing the standard normal distribution and U is orthonormal, G±s entries are also 
independent random variables following the standard normal distribution. Then given 
s > jn > n > r, we know G\ has full column rank r with probability 1. Therefore, 

range(V) = range(FEG^) = range(V) = range(A T ), 



and hence by Theorem 3.2 we have x = A^b almost surely. □ 

A more interesting property of LSRN is that the spectrum (the set of singular 
values) of the preconditioned system is solely associated with a random matrix of size 
s x r, independent of the spectrum of the original system. 

Lemma 4.2. In Algorithm^ the spectrum of AN is the same as the spectrum of 
G\ = (GUY , independent of A's spectrum. 

Proof. Following the proof of Theorem 4.1 let Gi = U\Y,\Vi be Gi's economy- 
sized SVD, where U x £ R sxr , Ex G M rxr , and V x £ R rxr . Since range(C/) = 
range(GA) = range(GCZ) = range([/i) and both U and U\ are orthonormal matri- 
ces, there exists an orthonormal matrix Q\ £ W xr such that U\ — UQ\. As a result, 

UtV T = A= GUYV T = J7 1 E 1 V 1 T I]V T = UQ^V?^. 

Multiplying by U T on the left of each side, we get T,V T = QiSi V 1 T EV T . Taking the 
pseudoinverse gives N = VTT 1 = V^V^ 1 Q\. Thus, 

AN = UYV T VYr 1 VxY^ 1 Q T l = UV^^Ql, 

which gives AN's SVD. Therefore, AN's singular values are diag(E 1 " 1 ), the same as 
G}'s spectrum, but independent of A's. □ 

We know that Gi = GU is a random matrix whose entries are independent 
random variables following the standard normal distribution. The spectrum of G\ is 
a well-studied problem in Random Matrix Theory, and in particular the properties of 
extreme singular values have been studied. Thus, the following lemma is important 
for us. We use V(-) to refer to the probability that a given event occurs. 

Lemma 4.3. (Davidson and Szarek [3]) Consider an s x r random matrix G\ with 
s > r, whose entries are independent random variables following the standard normal 
distribution. Let the singular values be o~\ > • • • > a r . Then for any t > 0, 

maxjTVi > y/s + ^/r + t),V{a T < y/s - y/r - t)} < e' t2/2 . (4.1) 



With the aid of Lemma |4.3[ it is straightforward to obtain the concentration 
result of ai(AN), a r (AN), and k(AN) as follows. 

Theorem 4.4. In Algorithmui for any a G (0, 1 — y/rjs), we have 



\V (<n(AN) > ,v(a r (AN) < fa-^ ) } 



< e 



-cts/2 



(4.2) 



vL(AN) = °-±m< 1 + a+ ^ 



a r {AN) ~ l - a - y/r/s 



> 1 - 2e- a2s/2 . 



(4.3) 
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Proof. Set t = a^fs in Lemma 4.3 



□ 



In order to estimate the number of iterations for CG-like methods, we can now 



combine (2.2) and (4.3) 



Theorem 4.5. In exact arithmetic, given a tolerance e > 0, a CG-like method 
applied to the preconditioned system mm. y \\ANy-b\\ 2 with =0 converges within 
(loge — log2)/log(a + \Jr/s) iterations in the sense that 

\\Vcg - V*\\(AN) T (AN) < £ h*\\(AN) T {AN) ( 4 - 4 ) 

holds with probability at least 1 — 2e~ a s / 2 for any a £ (0, 1 — \fsjr), where y C c 
is the approximate solution returned by the CG-like solver and y* = (ANyb. Let 
ice, = Ny CG be the approximate solution to the original problem. Since x* = Ny* , 



(4.4) is equivalent to 



\ A r A <s\\x*\\ A T A , (4.5) 



or in terms of residuals, 



<£||6-r*|| 2 , (4.6) 



where f CG = b — Ax CG and r* — b — Ax* . 

In addition to allowing us to bound the number of iterations for CG-like methods, 



the result given by (4.2 ) also allows us to use the CS method. This method needs only 
one level-1 and two level-2 BLAS operations per iteration; and, importantly, because 
it doesn't have vector inner products that require synchronization between nodes, this 
method is suitable for clusters with high communication cost. It does need an explicit 
bound on the singular values, but once that bound is tight, the CS method has the 
same theoretical upper bound on the convergence rate as other CG-like methods. 
Unfortunately, in many cases, it is hard to obtain such an accurate bound, which 



prevents the CS method becoming popular in practice. In our case, however, (4.2) 
provides a probabilistic bound with very high confidence. Hence, we can employ 
the CS method without difficulty. For completeness, Algorithm [3] describes the CS 
method we implemented for solving LS problems. For discussion of its variations, see 
Gutknecht and Rollin [L2] . 

4.3. Running time complexity. In this section, we discuss the running time 
complexity of LSRN. Let's first calculate the computational cost of LSRN (Algorithm [I]) 
in terms of floating-point operations (flops). Note that we need only E and V but 
not U or a full SVD of A in step 4 of Algorithm [l] In step 6, we assume that the 
dominant cost per iteration is the cost of applying AN and (AN) T . Then the total 
cost is given by 

sm x flops(randn) for generating G 

+ s x &ops(A T u) for computing A 

+ 2sn 2 + lln 3 for computing S and V pH P- 254] 

+ A^tcr x (flops(Aw) + flops(A T u) + 4w) for solving min || A/Vy — 6|| 2 , 

v 



where lower-order terms are ignored. Here, flops(randn) is the average flop count 
to generate a sample from the standard normal distribution, while flops(Av) and 



10 



XIANGRUI MENG, MICHAEL SAUNDERS, AND MICHAEL MAHONEY 



Algorithm 3 Chebyshev semi-iterative (CS) method (computes x m A^b) 



l: Given A € 



6 g 



and a tolerance e > 0, choose < <jl < &u such 



that all non-zero singular values of A are in [ol, ajj] and let d = (afj 



■a 2 L )/2 and 



c=(afj 
Let x — 0, v — 0, and r = b. 

(log e- log 2) /log 



for k — 0,1, ... , 


(ac/2) 2 
v <— j3v + A T r 
i <- i + ra 
7' <— r — aAv 
end for 



if k = 0, 
if fc= 1, 
otherwise, 



1/d 



do 



V(2d) 



l/(d - ac 2 /4) 



if fc = 0, 
if k = 1, 
otherwise. 



flops(A T u) are the flop counts for the respective matrix-vector products. If A is a 
dense matrix, then we have flops(Au) = flops(yl T u) = 2mn. Hence, the total cost 
becomes 

flops (LSRNdcnso) = sm flops(randn) + 2smn + 2sn 2 + lln 3 + Mter x (4mn + 4nr). 

Comparing this with the SVD approach, which uses 2mn 2 + lln 3 flops, we find LSRN 
requires more flops, even if we only consider computing A and its SVD. However, 
the actual running time is not fully characterized by the number of flops. A matrix- 
matrix multiplication is much faster than an SVD with the same number of flops. 
We empirically compare the running time in Section [6] If A is a sparse matrix, we 
generally have Hops(Av) and flops(A T M) of order 0(m). In this case, LSRN should run 
considerably faster than the SVD approach. Finally, if A is an operator, it is hard 
to apply SVD, while LSRN still works without any modification. If we set 7 = 2 and 



e = 10 -14 , we know iVi tC r ~ 100 by Theorem 4.5 and hence LSRN needs approximately 
2n + 200 matrix- vector multiplications. 

One advantage of LSRN is that the stages of generating G and computing A = GA 
are embarrassingly parallel. Thus, it is easy to implement LSRN in parallel. For exam- 
ple, on a shared-memory machine using p cores, the total running time decreases to 

r LSRN = Trandn/P + Tmult/p + T^ P + T itcI /p, (4.7) 

where T ran d n , T mu u, and Titer are the running times for the respective stages if LSRN 
runs on a single core, T™^ p is the running time of SVD using p cores, and commu- 
nication cost among threads is ignored. Hence, multi-threaded LSRN has very good 
scalability with near-linear speedup. 

Alternatively, let us consider a cluster of size p using MPI, where each node stores 
a portion of rows of A (with m>n). Each node can generate random samples and 
do the multiplication independently, and then an MPFReduce operation is needed to 
obtain A. Since n is small, the SVD of A and the preconditioner N are computed on a 
single node and distributed to all the other nodes via an MPLBcast operation. If the 
CS method is chosen as the iterative solver, we need one MPLAllreduce operation per 
iteration in order to apply A T . Note that all the MPI operations that LSRN uses are 
collective. If we assume the cluster is homogeneous and has perfect load balancing, 
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the time complexity to perform a collective operation should be O(\ogp). Hence the 
total running time becomes 

= T randn /p + T mult /p + T svd + T itcI / P + (d + C 2 N itei )0(logp), (4.8) 

where C\ corresponds to the cost of computing A and broadcasting N, and C'2 cor- 
responds to the cost of applying A T at each iteration. Therefore, the MPI implemen- 
tation of LSRN still has good scalability as long as T sv d is not dominant, i.e., as long 
as A is not too big. Typical values of n (or m for under-determined problems) in our 
empirical evaluations are around 1000, and thus this is the case. 

5. Tikhonov regularization. We point out that it is easy to extend LSRN to 
handle certain types of Tikhonov regularization, also known as ridge regression. Recall 
that Tikhonov regularization involves solving the problem 



1 



minimize 



\\Ax-b\\ 2 2 + -\\Wx\ 



2 

2' 



(5.1) 



where W £ R nxn controls the regularization term. In many cases, W is chosen as 
XI n for some value of a regularization parameter A > 0. It is easy to see that (5.1 ) is 
equivalent to the following LS problem, without any regularization: 



1 



minimize 



(5.2) 



This is an over-determined problem of size (to + n) x n. If m 3> n, then we certainly 
have TO + n 3> n. Therefore, if m 3> n, we can directly apply LSRN to (5.2 1 in order to 



solve (5.1 ). On the other hand, if to <C n, then although (5.2) is still over-determined 



it is "nearly square," in the sense that to + n is only slightly larger than n. In this 
regime, random sampling methods and random projection methods like LSRN do not 



perform well. In order to deal with this regime, note that (5.1 1 is equivalent to 



minimize 



subject to Ax + r = 6, 



Wx 



where r = b — Ax is the residual vector. (Note that we use r to denote the matrix 
rank in a scalar context and the residual vector in a vector context.) By introducing 
z = Wx and assuming that W is non-singular, we can re-write the above problem as 



1 



minimize 



subject to (AW^ 1 I„ 



= b, 



i.e., as computing the min-length solution to 



(AW- 1 I„ 



= b. 



(5.3) 



Note that (5.3) is an under-determined problem of size to x (to + n). Hence, if 
to <C n, we have to <C to + n and we can use LSRN to compute the min-length solution 
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Table 6.1 
LS solvers and their properties. 



solver 


min-len solution to 


takin 


g advantage of 


under-det? 


rank-def? 


sparse 


A 


operator A 


LAPACK's DGELSD 


yes 


yes 


no 




no 


MATLAB's backslash 


no 


no 


yes 




no 


Blcndcnpik 


yes 


no 


no 




no 


LSRN 


yes 


yes 


yes 




yes 



to (5.3 1, denoted by (Z*h The solution to the original problem (5.1) is then given 

by x* — W~ 1 z* . Here, we assume that W~ x is easy to apply, as is the case when 
W = XI n , so that AW -1 can be treated as an operator. The equivalence between 



(5.1) and (5.3) was first established by Herman, Lent, and Hurwitz [T3] . 

In most applications of regression analysis, the amount of regularization, e.g., 
the optimal regularization parameter, is unknown and thus determined by cross- 
validation. This requires solving a sequence of LS problems where only W differs. 
For over-determined problems, we only need to perform a random normal projection 
on A once. The marginal cost to solve for each W is the following: a random normal 
projection on W, an SVD of size [771] x n, and a predictable number of iterations. 
Similar results hold for under-determined problems when each W is a multiple of the 
identity matrix. 

6. Numerical experiments. We implemented our LS solver LSRN and com- 
pared it with competing solvers: LAPACK's DGELSD, MATLAB's backslash, and 
Blendenpik by Avron, Maymounkov, and Toledo [T). MATLAB's backslash uses differ- 
ent algorithms for different problem types. For sparse rectangular systems, as stated 
by Tim DavisQ "SuiteSparseQR [1 [5] is now QR in MATLAB 7.9 and x = A\b when 
A is sparse and rectangular." Table |6.1| summarizes the properties of those solvers. 
We report our empirical results in this section. 

6.1. Implementation and system setup. The experiments were performed 
on either a local shared-memory machine or a virtual cluster hosted on Amazon's 
Elastic Compute Cloud (EC2). The shared-memory machine has 12 Intel Xeon CPU 
cores at clock rate 2GHz with 128GB RAM. The virtual cluster consists of 20 ml. large 
instances configured by a third-party tool called StarClusteij^] An ml. large instance 
has 2 virtual cores with 2 EC2 Compute Units^ each. To attain top performance 
on the shared-memory machine, we implemented a multi-threaded version of LSRN in 
C, and to make our solver general enough to handle large problems on clusters, we 
also implemented an MPI version of LSRN in Python with NumPy, SciPy, and mpi4py. 
Both packages are available for download^] We use the multi-threaded implementation 
to compare LSRN with other LS solvers and use the MPI implementation to explore 
scalability and to compare iterative solvers under a cluster environment. To generate 
values from the standard normal distribution, we adopted the code from Marsaglia 
and Tsang [TB] and modified it to use threads; this can generate a billion samples in 



1 http : //www. cise .uf 1 . edu/research/sparse/SPQR/ 
2 http : / /web .mit . edu/stardev/cluster/ 

Une HVZ Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron 
or 2007 Xeon processor." from http://aws.amazon.com/ec2/faqs/ 
4 http : / /www. Stanford. edu/group/SQL/ sof tware/lsrn.html 
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Fig. 6.1. Left: vs. k(AN) for different choices of r and s. A £ R 10 xl ° is randomly 

generated with rank r. For each (r,s) pair, we take the largest value of k(AN) in 10 independent 
runs for each k+(A) and connect them using a solid line. The estimate (1 + wr/s)/(l — ^/r/s) is 
drawn in a dotted line for each (r, s) pair, if not overlapped with the corresponding solid line. Right: 
number of LSQR iterations vs. r/s. The number of LSQR iterations is merely a function ofr/s, 
independent of the condition number of the original system. 



less than two seconds on the shared-memory machine. We also modified Blendenpik 
to call multi-threaded FFTW routines. Blendenpik's default settings were used, i.e., 
using randomized discrete Fourier transform and sampling 4min(m, n) rows/columns. 
All LAPACK's LS solvers, Blendenpik, and LSRN are linked against MATLAB's own 
multi-threaded BLAS and LAPACK libraries. So, in general, this is a fair setup 
because all the solvers can use multi-threading automatically and are linked against 
the same BLAS and LAPACK libraries. The running times were measured in wall- 
clock times. 



6.2. k(AN) and number of iterations. Recall that Theorem 4.4 states that 
k(AN), the condition number of the preconditioned system, is roughly bounded by 
(1 + \Jr/s)/(l — \Jr/s) when s is large enough such that we can ignore a in practice. 
To verify this statement, we generate random matrices of size 10 4 x 10 3 with condition 



numbers ranged from 10 2 to 10 s . The left figure in Figure 6.1 compares n(AN) with 
k + (A), the effective condition number of A, under different choices of s and r. We take 
the largest value of k(AN) in 10 independent runs as the k(AN) in the plot. For each 
pair of s and r, the corresponding estimate (l+yV/s)/(l— \/ r /s) is drawn in a dotted 
line of the same color, if not overlapped with the solid line of k(AN). We see that 
(1 + Wr/s)/(l — y/r/s) is indeed an accurate estimate of the upper bound on k(AN). 
Moreover, k(AN) is not only independent of n + (A), but it is also quite small. For 
example, we have (1 + y/rjs) /(l — \frjs) < 6 if s > 2r, and hence we can expect super 
fast convergence of CG-likc methods. Based on Theorem 4.5 , the number of iterations 
should be less than (log e — log 2) / log \/r/s, where e is a given tolerance. In orde r to 
match the accuracy of direct solvers, we set e = lCT 14 . The right fi gure in Figure 6.1 



shows the number of LSQR iterations for different combinations of r/s and k + (A). 
Again, we take the largest iteration number in 10 independent runs for each pair of 
r/s and k+{A). We also draw the theoretical upper bound (logs — log 2)/ log \frj~s 
in a dotted line. We see that the number of iterations is basically a function of r/s, 
independent of k+(A), and the theoretical upper bound is very good in practice. This 
confirms that the number of iterations is fully predictable given 7. 
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Fig. 6.2. The overall running time of LSRN and the running time of each LSRN stage with 
different oversampling factor 7 for a randomly generated problem of size 10 s X 10 3 . For this particular 
problem, the optimal 7 that minimizes the overall running time lies in [1.8,2.2]. 



6.3. Tuning the oversampling factor 7. Once we set the tolerance and max- 
imum number of iterations, there is only one parameter left: the oversampling factor 
7. To demonstrate the impact of 7, we fix problem size to 10 5 x 10 3 and condition 
number to 10 6 , set the tolerance to 10 -14 , and then solve the problem with 7 ranged 
from 1.2 to 3. Figure [6T2| illustrates how 7 affects the running times of LSRN's stages: 
randn for generating random numbers, mult for computing A = GA, svd for comput- 
ing £ and V from A, and iter for LSQR. We see that, the running times of randn, 
mult, and svd increase linearly as 7 increases, while iter time decreases. Therefore 
there exists an optimal choice of 7. For this particular problem, we should choose 7 
between 1.8 and 2.2. We experimented with various LS problems. The best choice of 
7 ranges from 1.6 to 2.5, depending on the type and the size of the problem. We also 
note that, when 7 is given, the running time of the iteration stage is fully predictable. 
Thus we can initialize LSRN by measuring randn/sec and flops/sec for matrix- vector 
multiplication, matrix-matrix multiplication, and SVD, and then determine the best 
value of 7 by minimizing the total running time (4.8). For simplicity, we set 7 = 2.0 
in all later experiments; although this is not the optimal setting for all cases, it is 
always a reasonable choice. 

6.4. Dense least squares. As the state-of-the-art dense linear algebra library, 
LAPACK provides several routines for solving LS problems, e.g., DGELS, DGELSY, 
and DGELSD. DGELS uses QR factorization without pivoting, which cannot handle 
rank-deficient problems. DGELSY uses QR factorization with pivoting, which is 
more reliable than DGELS on rank-deficient problems. DGELSD uses SVD. It is the 
most reliable routine, and should be the most expensive as well. However, we find 
that DGELSD actually runs much faster than DGELSY on strongly over- or under- 
determined systems on the shared-memory machine. It may be because of better use 
of multi-threaded BLAS, but we don't have a definitive explanation. 

Figure |6.3| compares the running times of LSRN and competing solvers on ran- 
domly generated full-rank dense strongly over- or under-determined problems. We 
set the condition numbers to 10 6 for all problems. Note that DGELS and DGELSD 
almost overlapped. The results show that Blendenpik is the winner. For small-sized 
problems (m < 3e4), the follow-ups are DGELS and DGELSD. When the problem size 
goes larger, LSRN becomes faster than DGELS/DGELSD. DGELSY is always slower 
than DGELS/DGELSD, but still faster than MATLAB's backslash. The performance 
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Fig. 6.3. Running times on m X 1000 dense over- determined problems with full rank (left) 
and on 1000 X n dense under- determined problems with full rank (right). Note that DGELS and 
DGELSD almost overlap. When m > 3e4, we have Blendenpik > LSRN > DGELS/DGELSD > 
DGELSY > A\b in terms of speed. On under- determined problems, LAPACK's performance de- 
creases significantly compared with the over- determined cases. Blendenpik's performance decreases 
as well. LSRN doesn't change much. 




Fig. 6.4. Running times on mx 1000 dense over- determined problems with rank 800 (left) and 
on 1000 X n dense under- determined problems with rank 800 (right). LSRN takes advantage of rank 
deficiency. We have LSRN > DGSLS/DGELSD > DGELSY in terms of speed. 



of LAPACK's solvers decreases significantly for under-determined problems. We mon- 
itored CPU usage and found that they couldn't fully use all the CPU cores, i.e., they 
couldn't effectively call multi-threaded BLAS. Though still the best, the performance 
of Blendenpik also decreases. LSRN's performance does not change much. 

LSRN is also capable of solving rank-dchcient problems, and in fact it takes advan- 



tage of any rank-dehciency (in that it hnds a solution in fewer iterations). Figure 6.4 



shows the results on over- and under-determined rank-deficient problems generated 
the same way as in previous experiments, except that we set r = 800. DGELSY and 
DGELSD remain the same speed on over-determined problems as in full-rank cases, 
respectively, and run slightly faster on under-determined problems. LSRN's running 
times reduce to 93 seconds on the problem of size 10 6 x 10 3 , from 100 seconds on its 
full-rank counterpart. 

We see that, for strongly over- or under-determined problems, DGELSD is the 
fastest and most reliable routine among the LS solvers provided by LAPACK. How- 
ever, it (or any other LAPACK solver) runs much slower on under-determined prob- 
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Fig. 6.5. Running times onmx 1000 sparse over-determined problems with full rank (left) and 
on 1000 X n sparse under- determined problems with full rank (right). DGELS and DGELSD overlap 
with each other. LAPACK's solvers and Blendenpik perform almost the same as in the dense case. 
Matlab 's backslash speeds up on sparse problems, and performs a little better than Blendenpik, but 
it is still slower than LSRN. LSRN leads by a huge margin on under- determined problems as well. 



lems than on over-determined problems, while LSRN works symmetrically on both 
cases. Blendenpik is the fastest dense least squares solver in our tests. Though it is 
not designed for solving rank-deficient problems, Blendenpik should be modifiable to 
handle such problems following Theorem |3.2| We also note that Blendenpik's perfor- 
mance depends on the distribution of the row norms of U . We generate test problems 
randomly so that the row norms of U are homogeneous, which is ideal for Blendenpik. 
When the row norms of U are heterogeneous, Blendenpik's performance may drop. 
See Avron, Maymounkov, and Toledo [T] for a more detailed analysis. 

6.5. Sparse least squares. In LSRN, A is only involved in the computation of 
matrix-vector and matrix-matrix multiplications. Therefore LSRN accelerates auto- 
matically when A is sparse, without exploring ^4's sparsity pattern. LAPACK does 
not have any direct sparse LS solver. MATLAB 's backslash uses SuiteSparseQR by 
Tim Davis [5] when A is sparse and rectangular; this requires explicit knowledge of 
A's sparsity pattern to obtain a sparse QR factorization. 

We generated sparse LS problems using MATLAB's "sprandn" function with 
density 0.01 and condition number 10 6 . All problems have full rank. Figure 6.5 
shows the results on over-determined problems. LAPACK's solvers and Blendenpik 
basically perform the same as in the dense case. DGELSY is the slowest among the 
three. DGELS and DGELSD still overlap with each other, faster than DGELSY but 
slower than Blendenpik. We see that MATLAB's backslash handles sparse problems 
very well. On the 10 6 x 10 3 problem, backslash's running time reduces to 55 seconds, 
from 273 seconds on the dense counterpart. The overall performance of MATLAB's 
backslash is better than Blendenpik's. LSRN's curve is very flat. For small problems 
(to < 10 5 ), LSRN is slow. When m > 10 5 , LSRN becomes the fastest solver among the 
six. LSRN takes only 23 seconds on the over-determined problem of size 10 6 x 10 3 . On 
large under-determined problems, LSRN still leads by a huge margin. 

LSRN makes no distinction between dense and sparse problems. The speedup 
on sparse problems is due to faster matrix-vector and matrix-matrix multiplications. 
Hence, although no test was performed, we expect a similar speedup on fast linear 
operators as well. Also note that, in the multi-threaded implementation of LSRN, 
we use a naive multi-threaded routine for sparse matrix-vector and matrix-matrix 
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Table 6.2 

Real-world problems and corresponding running times in seconds. DGELSD doesn't take ad- 
vantage of sparsity. Though MATLAB's backslash (SuiteSparseQR) may not give the min-length 
solutions to rank- deficient or under- determined problems, we still report its running times. Blenden- 
pik either doesn't apply to rank-deficient problems or runs out of memory (OOM). LSRN's running 
time is mainly determined by the problem size and the sparsity. 



matrix 


m 


n 


nnz 


rank 


cond 


DGELSD 


A\b 


Blcndcnpik 


LSRN 


landmark 


71952 


2704 


1.15e6 


2671 


1.0e8 


29.54 


0.6498* 




17.55 


rail4284 


4284 


l.le6 


l.le7 


full 


400.0 


> 3600 


1.203* 


OOM 


136.0 


tnimg_l 


951 


lo6 


2.1e7 


925 




630.6 


1067* 




36.02 


tnimg_2 


1000 


2o6 


4.2e7 


981 




1291 


> 3600* 




72.05 


tnimg_3 


1018 


3o6 


6.3e7 


1016 




2084 


> 3600* 




111.1 


tnimg_4 


1019 


4o6 


8.4e7 


1018 




2945 


> 3600* 




147.1 


tnimg_5 


1023 


5c6 


l.lc8 


full 




> 3600 


> 3600* 


OOM 


188.5 



multiplications, which is far from optimized and thus leaves room for improvement. 

6.6. Real-world problems. In this section, we report results on some real- 
world large data problems. The problems are summarized in Table |6.2[ along with 
running times. 

landmark and rail4284 are from the University of Florida Sparse Matrix Col- 
lection [BJ. landmark originated from a rank-deficient LS problem. rail4284 has full 
rank and originated from a linear programming problem on Italian railways. Both 
matrices are very sparse and have structured patterns. MATLAB's backslash (SuiteS- 
parseQR) runs extremely fast on these two problems, though it doesn't guarantee 
to return the min-length solution. Blcndcnpik is not designed to handle the rank- 
deficient landmark, and it unfortunately runs out of memory (OOM) on rail4284. 
LSRN takes 17.55 seconds on landmark and 136.0 seconds on rail4284. DGELSD is 
slightly slower than LSRN on landmark and much slower on rail4284. 

tnimg is generated from the Tinylmages collection [23] , which provides 80 million 
color images of size 32 x 32. For each image, we first convert it to grayscale, compute 
its two-dimensional DCT, and then only keep the top 2% largest coefficients in mag- 
nitude. This gives a sparse matrix of size 1024 x 8e7 where each column has 20 or 21 
nonzero elements. Note that tnimg doesn't have apparent structured pattern. Since 
the whole matrix is too big, we work on submatriccs of different sizes. tnimgJ is the 
submatrix consisting of the first 10 6 x i columns of the whole matrix for i = 1, . . . , 80, 
where empty rows are removed. The running times of LSRN are approximately linear 
in n. Both DGELSD and MATLAB's backslash are very slow on the tnimg problems. 
Blendenpik either doesn't apply to the rank-deficient cases or runs OOM. 

We see that, though both methods taking advantage of sparsity, MATLAB's 
backslash relies heavily on the sparsity pattern, and its performance is unpredictable 
until the sparsity pattern is analyzed, while LSRN doesn't rely on the sparsity pattern 
and always delivers predictable performance and, moreover, the min-length solution. 

6.7. Scalability and choice of iterative solvers on clusters. In this section, 
we move to the Amazon EC2 cluster. The goals are to demonstrate that (1) LSRN 
scales well on clusters, and (2) the CS method is preferred to LSQR on clusters with 
high communication cost. The test problems are submatrices of the tnimg matrix in 
the previous section: tnimg_4, tnimgTO, tnimg_20, and tnimg_40, solved with 4, 10, 
20, and 40 cores respectively. Each process stores a submatrix of size 1024 x le6. Table 



6.3 shows the results, averaged over 5 runs. Ideally, from the complexity analysis (4.8 ), 
when we double n and double the number of cores, the increase in running time should 
be a constant if the cluster is homogeneous and has perfect load balancing (which we 
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Table 6.3 

Test problems on the Amazon EC2 cluster and corresponding running times in seconds. When 
we enlarge the problem scale by a factor of 10 and increase the number of cores accordingly, the 
running time only increases by a factor of 50%. It shows LSRN's good scalability. Though the CS 
method takes more iterations, it is faster than LSQR by saving communication cost. 



solver 


■*V nodes 


rip 


matrix 


m 


n 


nnz 


JVitor 


Titer 


Ttotal 


LSRN w/ CS 
LSRN w/ LSQR 


2 


4 


tnimg_4 


1024 


4e6 


8.4e7 


106 
84 


34.03 
41.14 


170.4 
178.6 


LSRN w/ CS 
LSRN w/ LSQR 


5 


10 


tnimg_10 


1024 


le7 


2.1e8 


106 

84 


50.37 
68.72 


193.3 
211.6 


LSRN w/ CS 
LSRN w/ LSQR 


10 


20 


tnimg_20 


1024 


2e7 


4.2e8 


106 

84 
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have observed is not true on Amazon EC2). For LSRN with CS, from tnimg_10 to 
tnimg_20 the running time increases 27.6 seconds, and from tnimg_20 to tnimg_40 
the running time increases 34.7 seconds. We believe the difference between the time 
increases is caused by the heterogeneity of the cluster, because Amazon EC2 doesn't 
guarantee the connection speed among nodes. From tnimg_4 to tnimg_40, the problem 
scale is enlarged by a factor of 10 while the running time only increases by a factor 
of 50%. The result still demonstrates LSRN's good scalability. We also compare the 
performance of LSQR and CS as the iterative solvers in LSRN. For all problems LSQR 
converges in 84 iterations and CS converges in 106 iterations. However, LSQR is slower 
than CS. The communication cost saved by CS is significant on those tests. As a result, 
we recommend CS as the default LSRN iterative solver for cluster environments. Note 
that to reduce the communication cost on a cluster, we could also consider increasing 
7 to reduce the number of iterations. 

7. Conclusion. We developed LSRN, a parallel solver for strongly over- or under- 
determined, and possibly rank-deficient, systems. LSRN uses random normal projec- 
tion to compute a preconditioner matrix for an iterative solver such as LSQR and 
the Chebyshev semi-iterative (CS) method. The preconditioning process is embar- 
rassingly parallel and automatically speeds up on sparse matrices and fast linear 
operators, and on rank-deficient data. We proved that the preconditioned system is 
consistent and extremely well-conditioned, and derived strong bounds on the number 
of iterations of LSQR or the CS method, and hence on the total running time. On 
large dense systems, LSRN is competitive with the best existing solvers, and it runs sig- 
nificantly faster than competing solvers on strongly over- or under-determined sparse 
systems. LSRN is easy to implement using threads or MPI, and it scales well in parallel 
environments. 
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